In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
from dateutil import parser
from datetime import datetime
First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use
311-2014.csv
. You can rename it.
Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.
In [30]:
complaint_df = pd.read_csv("311-2014.csv", parse_dates = True).head(200000)
In [32]:
complaint_df.dtypes
Out[32]:
In [8]:
import datetime
# complaint_df['Created Date'] = complaint_df['Created Date'].astype(dateutil.parser.parse)
In [34]:
complaint_df['Created Date'] = complaint_df['Created Date'].apply(dateutil.parser.parse)
In [36]:
complaint_df['Created Date'].describe()
Out[36]:
In [38]:
complaint_df['Created Date'].head()
Out[38]:
In [40]:
#import datetime
#dt = datetime.datetime.strptime(string_date, fmt)
In [42]:
#import numpy as np
# Build a function using that method
#str_time = "07/06/2015 10:58:27 AM"
#def time_to_datetime(str_time):
# try:
# dt = parser.parse(str_time)
# return dt
# except:
# return np.nan
#time_to_datetime(str_time)
In [44]:
# complaint_df['Created Date'] = complaint_df['Created Date'].apply(time_to_datetime)
In [46]:
complaint_df['Created Date'].describe()
Out[46]:
What was the most popular type of complaint, and how many times was it filed?
In [ ]:
complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)
Make a horizontal bar graph of the top 5 most frequent complaint types.
In [48]:
import matplotlib.pyplot as plt
%matplotlib inline
freqNYcomplaints = complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(5).plot(kind="bar", color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'])
freqNYcomplaints.set_title('Most Frequent NYC Complaints')
freqNYcomplaints.set_xlabel('Type of Complaint')
freqNYcomplaints.set_ylabel('Frequency of Complaint')
plt.savefig('freqNYcomplaints.png')
Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.
In [ ]:
complaint_df.head()
According to your selection of data, how many cases were filed in March? How about May?
In [56]:
#confirm this is a date type
# complaint_df['Created Date']
In [ ]:
In [55]:
complaint_df.index = complaint_df['Created Date']
# complaint_df.head()
In [199]:
#complaint_df = complaint_df.set_index(['Created Date'])
#complaint_df['Created Date'] = pd.to_datetime(complaint_df['Created Date'])
ax = complaint_df['2015-03'].resample('D').count().plot(y='Unique Key')
# complaint_df['2015-04']['Complaint Type'].count()
ax.set_title('NYC Complaints in March 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MarchfreqNYcomplaints.png')
In [200]:
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')
In [58]:
complaint_df.head()
Out[58]:
I'd like to see all of the 311 complaints called in on April 1st.
Surprise! We couldn't do this in class, but it was just a limitation of our data set
In [67]:
print("There were", complaint_df['2015-04-01']['Unique Key'].count(), "complaints on April 1st.")
What was the most popular type of complaint on April 1st?
What were the most popular three types of complaint on April 1st
In [71]:
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)
Out[71]:
In [74]:
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(3)
Out[74]:
What month has the most reports filed? How many? Graph it.
In [77]:
complaint_df.resample('M')['Complaint Type'].count().sort_values(ascending=False).head(1)
Out[77]:
In [201]:
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')
What week of the year has the most reports filed? How many? Graph the weekly complaints.
In [97]:
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(3)
Out[97]:
In [102]:
complaint_df['2015-05-10'].resample('W')['Complaint Type'].count().plot(y='Unique Key')
Out[102]:
Noise complaints are a big deal. Use .str.contains
to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).
In [202]:
ax = complaint_df['Complaint Type'].str.contains('Noise').resample('W').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in 2015')
ax.set_xlabel('Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('AnnualfreqNYcomplaints.png')
In [210]:
ax = complaint_df['Complaint Type'].str.contains('Noise').groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key')
ax.set_title('Daily NYC SoundComplaints in 2015')
ax.set_xlabel('Number of Complaints')
plt.savefig('DailyfreqNYSoundcomplaints.png')
#ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
#ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
In [ ]:
Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.
In [124]:
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(5)
Out[124]:
In [176]:
#may_17 = complaint_df['2015-05-17']
#ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key')
In [183]:
may_17 = complaint_df['2015-05-17']
ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key', label="May 17th")
may_24 = complaint_df['2015-05-24']
may_24.groupby(by=may_24.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 24th")
may_03 = complaint_df['2015-05-03']
may_03.groupby(by=may_03.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 3rd")
may_10 = complaint_df['2015-05-10']
may_10.groupby(by=may_10.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 10th")
may_31 = complaint_df['2015-05-31']
may_31.groupby(by=may_31.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 31st")
ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Out[183]:
In [174]:
ax
Out[174]:
In [182]:
# ax = complaint_df['2015-05-10'].resample('H').count().plot(y='Unique Key')
#ax = complaint_df['2015-05-17'].groupby(by=complaint_df.index.hour).count().plot(y='Unique Key', label = "May 17th")
# complaint_df['2015-05-24'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 24th")
# complaint_df['2015-05-03'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 3rd")
# complaint_df['2015-05-31'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 31st")
# # the least sold in a month (e.g., the worst January ever , the worst July ever, etc)
# complaint_df['2015-05-17'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = 'May 17th')
# the most sold in a month (e.g., the best January ever , the best July ever, etc)
#df.groupby(by=df.index.month).max().plot(y='val', ax=ax, label = 'Most Sold')
What hour of the day are the most complaints? Graph a day of complaints.
In [197]:
#make a graph of the average
ax = complaint_df.groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key', label='Average')
#x_values = complaint_df.groupby(by=complaint_df.index.hour).median().index
# get the minimum
#min_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].min()
#get maxs
#max_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].max()
#ax.fill_between(x_values, min_values, max_values, alpha=0.3)
#ax.set_title("Fluctuation of House Sales")
In [192]:
complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].median()
Out[192]:
One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?
In [214]:
ax = complaint_df['Complaint Type'].str.contains('04:').resample('T').count().plot(y='Unique Key')
In [ ]:
In [ ]:
So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.
In [ ]:
Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).
In [ ]:
Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?
In [ ]:
Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: